Search CORE

28 research outputs found

Factors Influencing the Surprising Instability of Word Embeddings

Author: Kummerfeld Jonathan K.
Mihalcea Rada
Wendlandt Laura
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this paper, we consider one aspect of embedding spaces, namely their stability. We show that even relatively high frequency words (100-200 occurrences) are often unstable. We provide empirical evidence for how various factors contribute to the stability of word embeddings, and we analyze the effects of stability on downstream tasks.Comment: NAACL HLT 201

arXiv.org e-Print Archive

Crossref

SLATE: A Super-Lightweight Annotation Tool for Experts

Author: Kummerfeld Jonathan K.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Many annotation tools have been developed, covering a wide variety of tasks and providing features like user management, pre-processing, and automatic labeling. However, all of these tools use Graphical User Interfaces, and often require substantial effort to install and configure. This paper presents a new annotation tool that is designed to fill the niche of a lightweight interface for users with a terminal-based workflow. Slate supports annotation at different scales (spans of characters, tokens, and lines, or a document) and of different types (free text, labels, and links), with easily customisable keybindings, and unicode support. In a user study comparing with other tools it was consistently the easiest to install and use. Slate fills a need not met by existing systems, and has already been used to annotate two corpora, one of which involved over 250 hours of annotation effort.Comment: To appear at ACL as a dem

arXiv.org e-Print Archive

Crossref

Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection

Author: Jiang Youxuan
Kummerfeld Jonathan K.
Lasecki Walter S.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

Linguistically diverse datasets are critical for training and evaluating robust machine learning systems, but data collection is a costly process that often requires experts. Crowdsourcing the process of paraphrase generation is an effective means of expanding natural language datasets, but there has been limited analysis of the trade-offs that arise when designing tasks. In this paper, we present the first systematic study of the key factors in crowdsourcing paraphrase collection. We consider variations in instructions, incentives, data domains, and workflows. We manually analyzed paraphrases for correctness, grammaticality, and linguistic diversity. Our observations provide new insight into the trade-offs between accuracy and diversity in crowd responses that arise as a result of task design, providing guidance for future paraphrase generation procedures.Comment: Published at ACL 201

arXiv.org e-Print Archive

Crossref

Using Paraphrases to Study Properties of Contextual Embeddings

Author: Burdick Laura
Kummerfeld Jonathan K.
Mihalcea Rada
Publication venue
Publication date: 12/07/2022
Field of study

We use paraphrases as a unique source of data to analyze contextualized embeddings, with a particular focus on BERT. Because paraphrases naturally encode consistent word and phrase semantics, they provide a unique lens for investigating properties of embeddings. Using the Paraphrase Database's alignments, we study words within paraphrases as well as phrase representations. We find that contextual embeddings effectively handle polysemous words, but give synonyms surprisingly different representations in many cases. We confirm previous findings that BERT is sensitive to word order, but find slightly different patterns than prior work in terms of the level of contextualization across BERT's layers.Comment: Published at NAACL 202

arXiv.org e-Print Archive